Introduction to Open Data Science - Course Project

About the Project

Introduction to Open Data Science. Happy to learn to connect Git and R even though the content of data analysis is familiar to me. I’ve used R MarkDown before and found it useful. In addition I’ve used Git before on Tietokantojen Perusteet course. However, it’s been a long time so recalling things is needed. Here’s the link to my IODS GitHub repository.

Today is the following day.

## [1] "Mon Nov 16 16:15:13 2020"

Reminders

I wanted to remind me what I’ve done last time with R MarkDown. I found a nice data on exchange and forward rates. I make a table and a graph of Sterling/EUR exchange rate. I uploaded forward2.dat to git repository and I call my data set from there in order to plot the rate. I hide the R code to make the outcome more readable.

Table of the exchange and forward rates

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
EXUSBP EXUSEUR EXEURBP F1USBP F1USEUR F1EURBP F3USBP F3USEUR F3EURBP
Min. :1.073 Min. :0.5827 Min. :0.3567 Min. :1.067 Min. :0.5873 Min. :0.3588 Min. :1.061 Min. :0.5941 Min. :0.3624
1st Qu.:1.507 1st Qu.:0.8876 1st Qu.:0.4855 1st Qu.:1.504 1st Qu.:0.8885 1st Qu.:0.4845 1st Qu.:1.501 1st Qu.:0.8926 1st Qu.:0.4841
Median :1.617 Median :1.0738 Median :0.5598 Median :1.616 Median :1.0774 Median :0.5589 Median :1.609 Median :1.0855 Median :0.5588
Mean :1.665 Mean :1.0416 Mean :0.6213 Mean :1.663 Mean :1.0447 Mean :0.6203 Mean :1.658 Mean :1.0503 Mean :0.6184
3rd Qu.:1.756 3rd Qu.:1.1741 3rd Qu.:0.7107 3rd Qu.:1.753 3rd Qu.:1.1778 3rd Qu.:0.7091 3rd Qu.:1.748 3rd Qu.:1.1819 3rd Qu.:0.7051
Max. :2.443 Max. :1.4222 Max. :1.6002 Max. :2.441 Max. :1.4240 Max. :1.5954 Max. :2.433 Max. :1.4278 Max. :1.5863

Graph of the GBP/EUR exchange rate.

Latex

I’m happy to see that R MarkDown is linked to LaTeX syntax! Here’s a maximization problem from my current research on the optimal mechanism design with enforcement: \[\begin{align} \max_{r(\cdot),t(\cdot),m(\cdot)} \mathbb{E} \left[ t(\theta) - K m(\theta) \right]\label{OBJ} \tag{MAX} \end{align}\] subject to \[\begin{align} &t(\theta) = \theta r(\theta) - V(\underline{\theta}) - \int_{\underline{\theta}}^{\theta} \mathcal{I}(s|s)ds \label{TAX} \tag{TAX} \\ &\mathcal{I}(\theta|\theta) \qquad \text{is nondecreasing} \label{IC} \tag{IC} \\ &\mathcal{I}(\theta|\theta) \geq 0 \label{IR} \tag{IR} \end{align}\] for all \(\theta \in \Theta\) with \(\mathcal{I}(\theta|\theta):=r(\theta) - m(\theta) \varphi\).


Regression and Model Validation

Today is

## [1] "Mon Nov 16 16:15:14 2020"

In this week we learn how to create data and analyse it. We read the data from the given URL. The meta descriptions can be found from here.

The dataset queries the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. The original data have 183 responders (observations) for 97 questions (variables). For our purposes, we subset the dataset to contain variables gender, age, attitude, deep, stra, surf and points. The variables are either integers or numerics. The meta-file gives the following description for each variable:

Next we omit the observations where the exam points variable is zero. After that we scale variables by the mean – that is, we divide each combination variable by its number of questions asked. So we have 166 observations for 7 variables left.

Descriptive Analysis

Let us next read the data we created earlier in the data wranling part:

# Read the data
learning2014 <- read.csv("/Users/teemupekkarinen/IODS-project/data/learning2014",row.names = 1)

To summarize our dataset’s content we run

# Structure
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Gender  : int  2 1 2 1 1 2 1 2 1 2 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
# Dimension
dim(learning2014)
## [1] 166   7

That is, we have the information about gender, age, attitude, and three different learning approaches (deep, strategic, and surface) of 166 responders. All of the variables are numeric and the combination variables (attitude, deep, stra, and surf) are means. For more detailed description we run:

# Summary
summary(learning2014)
##      Gender           Age           Attitude          Deep      
##  Min.   :1.000   Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  1st Qu.:1.000   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Median :2.000   Median :22.00   Median :3.200   Median :3.667  
##  Mean   :1.663   Mean   :25.51   Mean   :3.143   Mean   :3.680  
##  3rd Qu.:2.000   3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##  Max.   :2.000   Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

We observe that there are almost twice more female responders than male responders: 56 men and 110 women. The average age of the responders is 26 and the youngest person is 17 years old and the oldest 55 years old. The age distribution is wide but skewed towards young people. As summarize we can say than a typical responder is a young female.

# Histrograms
Gender <- learning2014$Gender
hist(Gender,breaks=2)

sum(Gender==1)
## [1] 56
Age <- learning2014$Age
hist(Age)

mean(Age)
## [1] 25.51205
var(Age)
## [1] 60.31198

The average exam points is 28 with maximum 33 and minimum 7. The exam results are more or less normally distributed among all participants.

# Histrograms
Points <- learning2014$Points
hist(Points)

However, when we compare the points between men and female, we observe that the mean of exam points for men is slightly greater than for women. There are even more men with score 30 or 31 than women.

# Package
library(ggplot2)

# Histrograms
PointsMale <- learning2014$Points[Gender==1]
mean(PointsMale)
## [1] 23.48214
PointsFemale <- learning2014$Points[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(PointsMale, PointsFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

The mean of attitude, 3.14, is almost neutral 3. That is, on average responders have a bit positive attitude towards statistics. Also attitude is nearly normally distributed. Male responders have more positive attitude than female.

# Histrograms
Attitude <- learning2014$Attitude
hist(Attitude)

AttitudeMale <- learning2014$Attitude[Gender==1]
mean(PointsMale)
## [1] 23.48214
AttitudeFemale <- learning2014$Attitude[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(AttitudeMale, AttitudeFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

The minority prefers surface learning approach to learning in the statistics course, whereas the majority prefers the methods of deep learning. Moreover, the greatest variation in learning methods is in strategic learning. Deep and surface learning is almost identically distributed expect deep learning has a greater mean.

# Package
library(ggplot2)

# Histrograms
Deep <- learning2014$Deep
Stra <- learning2014$Stra
Surf <- learning2014$Surf
data <- data.frame(
  type = c( rep("Deep", 166), rep("Stra", 166), rep("Surf", 166) ),
  value = c(Deep, Stra, Surf)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

There is no much difference between genders in learning methods; the distributions between male and female responders are almost indentically distributed.

# Histrograms

# Strategic Learning
StraMale <- learning2014$Stra[Gender==1]
StraFemale <- learning2014$Stra[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(StraMale, StraFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

# Deep Learning
DeepMale <- learning2014$Deep[Gender==1]
DeepFemale <- learning2014$Deep[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(DeepMale, DeepFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

# Surface Learning
SurfMale <- learning2014$Surf[Gender==1]
SurfFemale <- learning2014$Surf[Gender==2]
data <- data.frame(
  type = c( rep("Male", 56), rep("Female", 110) ),
  value = c(SurfMale, SurfFemale)
)
ggplot(data, aes(x=value, fill=type)) +
    geom_histogram(binwidth=.5, alpha=.5, position="identity")

Regressions

Next we run regressions where exam points is our dependent variable. Let us first look at the correlations between exam points and the explanatory variables.

plot(Attitude, Points, main = "Regression",
     xlab = "Attitude", ylab = "Points") + abline(lm(Points~Attitude), col = "blue")

## integer(0)
plot(Gender, Points, main = "Regression",
     xlab = "Gender", ylab = "Points") + abline(lm(Points~Gender), col = "blue")

## integer(0)
plot(Age, Points, main = "Regression",
     xlab = "Age", ylab = "Points") + abline(lm(Points~Age), col = "blue")

## integer(0)
plot(Deep, Points, main = "Regression",
     xlab = "Deep Learning", ylab = "Points") + abline(lm(Points~Deep), col = "blue")

## integer(0)
plot(Stra, Points, main = "Regression",
     xlab = "Strategic Learning", ylab = "Points") + abline(lm(Points~Stra), col = "blue")

## integer(0)
plot(Surf, Points, main = "Regression",
     xlab = "Surface Learning", ylab = "Points") + abline(lm(Points~Surf), col = "blue")

## integer(0)
cov(learning2014)
##               Gender        Age    Attitude        Deep        Stra        Surf
## Gender    0.22489960 -0.4383352 -0.10184739 -0.01526713  0.05326762  0.02826457
## Age      -0.43833516 60.3119752  0.12584520  0.10792260  0.61406079 -0.58075332
## Attitude -0.10184739  0.1258452  0.53276561  0.04458987  0.03478322 -0.06776013
## Deep     -0.01526713  0.1079226  0.04458987  0.30706766  0.04127419 -0.09489017
## Stra      0.05326762  0.6140608  0.03478322  0.04127419  0.59572437 -0.06570525
## Surf      0.02826457 -0.5807533 -0.06776013 -0.09489017 -0.06570525  0.27967222
## Points   -0.25972983 -4.2662651  1.87824388 -0.03315078  0.66483662 -0.45002434
##               Points
## Gender   -0.25972983
## Age      -4.26626506
## Attitude  1.87824388
## Deep     -0.03315078
## Stra      0.66483662
## Surf     -0.45002434
## Points   34.74965316

From the single regressions and the covariance matrix we observe that there is negative correlation between points and gender, age, deep learning and surface learning. That is, what we already observed, men were doing better in the course than women. Moreover, it seems to be so that younger students got better exam results than the older participants. Strategic learning was the only method that has positive correlation between exam results. There is a big positive correaltion between attitude and points.

Next we choose first gender, attitude and age to regress exam points.

reg <- lm(Points~Gender+Attitude+Age)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Gender + Attitude + Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.76802    3.17695   4.019 8.93e-05 ***
## Gender       0.33054    0.91934   0.360    0.720    
## Attitude     3.60657    0.59322   6.080 8.34e-09 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

From here we observe that attitude is the only significant explanatory variable by itself. However, by the F-test all three variables are together significant explanatories.

Nevertheless, following the instructions we drop off gender variable and run the regression again.

# Regression
reg <- lm(Points~Attitude+Age)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Attitude + Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3354  -3.3095   0.2625   4.0005  10.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.57244    2.24943   6.034 1.04e-08 ***
## Attitude     3.54392    0.56553   6.267 3.17e-09 ***
## Age         -0.07813    0.05315  -1.470    0.144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1913 
## F-statistic: 20.52 on 2 and 163 DF,  p-value: 1.125e-08

Attitude remains to be the strongest predictor. Age is still unsignifant even though together with attitude it is significant (F-test). It turns out, after trying all possible combinations of explanatory variables, only regressing with attitude we get 5 % significant level of all regressors. We thus lastly regress only with attitude and get the following.

reg <- lm(Points~Attitude)
summary(reg)
## 
## Call:
## lm(formula = Points ~ Attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

From the summary we observe that the intercept term is positive and significant: 11.63 with standard error of 1.83. This is the test score if attitude was “zero”. Each unit jump in attitution increases (not causal) exam results by 3.5 points. Hence, with maximum attitude our model predicts that the test result is \(11.63 + 5 * 3.5 = 29\) points.

We observe also that R-squared decreases a little everytime we omit an explanatory variable from the regression. R-squared is the measure that represents the proportion of the variance for the dependent variable that is explained by explanatory variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.

Lastly we plot the regression diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

plot(reg)

Residuals vs Fitted values figure is a simple scatterplot between residuals and predicted values. It should look more or less random, which is roughly the case in our analysis. The red line is not exactly zero all the time, which is a sign of that the residuals may have a little positive predictive power (omitted variable bias).

The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In our cases the tails scarper from the diagonal which may imply that we have some problems with assumption of normally distributed error terms. Probably some t-distribution could be better.

The last plot for the residuals vs leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It turns out that points 35, 56, and 71 have the strongest effects on the dependent variable.

Causal Inference

For the causal inference we need to have very strong assumptions.

Before going to the assumptions, we introduce the vector and matrix notation. Define K-dimensional (column) vectors \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) as

\[ \underset{(K \times 1)}{\boldsymbol{x}_i} = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iK} \end{pmatrix}, \underset{(K \times 1)}{\boldsymbol{\beta}} = \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{K} \end{pmatrix}.\] Also define \[ \underset{(n \times 1)}{\boldsymbol{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{n} \end{pmatrix}, \underset{(n \times 1)}{\boldsymbol{\varepsilon}} = \begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}, \underset{(n \times K)}{\boldsymbol{X}} = \begin{pmatrix} \boldsymbol{x}'_{1} \\ \vdots \\ \boldsymbol{x}'_{n} \end{pmatrix} = \begin{pmatrix} x_{11} & \dots & x_{1K} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nK} \end{pmatrix}. \] Scalar quantities will be given in normal font \(x\), vector quantities in bold lowercase \(\boldsymbol{x}\), and all vectors will be presumed to be column vectors. Matrix quantities will be in bold uppercase \(\boldsymbol{X}\). The transpose of a matrix is denoted by either \(\boldsymbol{X}'\) or \(\boldsymbol{X}^T\).

Using the matrix notation, the assumptions for the causal inference are the following

Assumption 1.1 (linearity): \[\underset{(n \times 1)}{\boldsymbol{y}} = \underset{\underbrace{(n \times K)(K \times 1)}_{(n \times 1)}}{\boldsymbol{X} \: \: \: \boldsymbol{\beta}} + \underset{(n \times 1)}{\boldsymbol{\varepsilon}}.\]

Assumption 1.2 (strict exogeneity): \[\mathbb{E}(\varepsilon_{i}|\boldsymbol{X}) = 0 \: \: \: (i = 1, 2, \dots, n).\]

Assumption 1.3 (no multicollinearity): The rank of the \(n \times K\) data matrix, \(\boldsymbol{X}\), is \(K\) with probability 1.

Assumption 1.4 (spherical error variance): \[(\text{homoskedasticity}) \: \: \mathbb{E}(\varepsilon_{i}^2|\boldsymbol{X}) = \sigma^2 > 0 \: \: \: (i = 1, 2, \dots, n),\] \[(\text{no correlation between observations}) \: \: \: \mathbb{E}(\varepsilon_{i}\varepsilon_{j}|\boldsymbol{X}) = 0 \: \: \: (i, j = 1, 2, \dots, n; i \neq j).\]

Even though our regression diagnostic plots are promising, I would not make any kind of causal interpretations from our model.


Clustering and Classification

Today is

## [1] "Mon Nov 16 16:15:17 2020"

This week we use the Boston data from the MASS package. The data have 14 variables and 506 observartions for each variable. The variables are either numerical or integers.

# Package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Namely, the variables are

The correlation matrix can be illustrated by using corrplot package as follows

library(corrplot)
## corrplot 0.84 loaded
corr_matrix<-cor(Boston)
corrplot(corr_matrix)

That is, there is a a significant positive correlation between crim, rad, tax and lsat. On the other hand, the weighted mean of distances to five Boston employment centres, dis, has negative correlation with indus, nox, and age.

However, since the crime rate seems to be the variable in interest, let us illustrate it by some graphics.

require(ggplot2)
require(plotly)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data = Boston, x = ~crim, type = "histogram")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(data = Boston, x = ~rad, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~tax, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~lstat, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Next we standardize the dataset and print out summaries of the scaled data.

scaled.Boston <- data.frame(scale(Boston))

Now all variables have mean of \(0\) and variance \(1\). For instance, now the plot of crim and lstat is the following.

plot_ly(data = scaled.Boston, x = ~lstat, y = ~crim)
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Next we divide crim in 5 categories based on the 20/%-quantiles.

library(gtools)
q.crim <- quantcut(scaled.Boston$crim,q = 5)
summary(q.crim)
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]  (-0.356,0.229]    (0.229,9.92] 
##             102             101             101             101             101
scaled.Boston$crim <- q.crim

We divide the dataset into train and test sets, so that 80/% of the data belongs to the train set.

sample <- sample.int(n = nrow(scaled.Boston), size = floor(.8*nrow(scaled.Boston)), replace = F)
train <- scaled.Boston[sample, ]
test  <- scaled.Boston[-sample, ]

The linear discriminant analysis:

# MASS and train are available

# linear discriminant analysis
lda.fit <- lda(crim~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crim ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]  (-0.356,0.229]    (0.229,9.92] 
##       0.2103960       0.2004950       0.1856436       0.1955446       0.2079208 
## 
## Group means:
##                          zn      indus        chas        nox          rm
## [-0.419,-0.413]  1.21775653 -0.9627119 -0.04073494 -0.9360967  0.47451291
## (-0.413,-0.403] -0.08228945 -0.3604520 -0.07790436 -0.5828091  0.08707506
## (-0.403,-0.356] -0.30915715 -0.1593473  0.14762829 -0.2926223 -0.01118825
## (-0.356,0.229]  -0.42211036  0.5824967  0.17620134  0.8473678 -0.16650436
## (0.229,9.92]    -0.48724019  1.0149946 -0.13171834  1.0289602 -0.40874880
##                         age         dis         rad        tax    ptratio
## [-0.419,-0.413] -0.92544401  1.04418412 -0.70353683 -0.7335203 -0.4832093
## (-0.413,-0.403] -0.36844110  0.23833631 -0.60046672 -0.5685973 -0.1853223
## (-0.403,-0.356]  0.02495419  0.06662518 -0.48573345 -0.4356537  0.1030664
## (-0.356,0.229]   0.63935981 -0.54164278  0.09536178  0.2263908 -0.2443255
## (0.229,9.92]     0.84411276 -0.89802125  1.65960290  1.5294129  0.8057784
##                      black       lstat        medv
## [-0.419,-0.413]  0.3798922 -0.79190793  0.48801994
## (-0.413,-0.403]  0.3591934 -0.35283360  0.25362611
## (-0.403,-0.356]  0.2961959  0.05984219  0.06746983
## (-0.356,0.229]  -0.1385576  0.10660048  0.05231187
## (0.229,9.92]    -0.8321791  1.04493383 -0.88712672
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3         LD4
## zn      -0.06064150  0.909102546  0.815167417 -0.30785188
## indus    0.07377266 -0.241636190  0.089557223  0.38544820
## chas    -0.04372562  0.036973335 -0.009766943 -0.37059360
## nox      0.39875061 -0.842801427  1.345923605 -0.71798146
## rm       0.13448510  0.328819214 -0.258092350 -0.33956212
## age      0.15596111 -0.294379525  0.375879207 -0.39361747
## dis     -0.12029184 -0.541535217  0.464810158 -0.71258238
## rad      1.74850208  1.000597656 -0.058187120  0.56260148
## tax      0.01150452 -0.172106779 -0.149102454  0.01705955
## ptratio -0.06689079 -0.038192933 -0.098551323 -0.82591418
## black   -0.17864125  0.002611965 -0.112496628 -0.02563059
## lstat    0.27124585 -0.052097152 -0.937986408 -0.94884908
## medv    -0.06649171 -0.858984349 -0.041792506 -0.42127718
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.8429 0.1012 0.0497 0.0062
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Next we cross tabulate the results with the crime categories from the test set.

# lda.fit, correct_classes and test are available
correct_classes <- test$crim

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]
##   [-0.419,-0.413]               7               7               3
##   (-0.413,-0.403]              13               7               0
##   (-0.403,-0.356]               2              11              12
##   (-0.356,0.229]                0               1               7
##   (0.229,9.92]                  0               0               0
##                  predicted
## correct           (-0.356,0.229] (0.229,9.92]
##   [-0.419,-0.413]              0            0
##   (-0.413,-0.403]              0            0
##   (-0.403,-0.356]              1            0
##   (-0.356,0.229]               6            8
##   (0.229,9.92]                 2           15

Let us next reload the Boston dataset and standardize it. Then we calculate the distances between the observations.

# load MASS and Boston
library(MASS)
data('Boston')

# standardization
Boston <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)


(more chapters to be added similarly as we proceed with the course!)